What is data profiling?
Data profiling is the process of extracting information about data. Given tabular data (think of an Excel spreadsheet), we commonly want to extract the following properties about each column:
- Number of rows
- Number of cells without data
- Number of cells with a value of zero
- Number of distinct/unique values
- Number of duplicate rows
- Minimum, mean, median, maximum, quantiles, range, standard deviation, variance, sum
- Values distribution
- Most common values
- Examples of values
The process of data profiling allows a data scientist or engineer to identify quickly potential sources of problems in the data such as:
- Negative numbers when numbers should all be positive
- Missing values which may need to be imputed or for which the row may have to be removed
- Issues with the distribution of values such as class imbalance if we plan to solve a classification problem
In an ideal situation, data profiling reports:
- No missing cells, this way you do not have to ask if data can be filled in or you don't need to impute the data using assumptions
- Proper normalization of the data (e.g., value separate from their unit), this way the data can be used as-is, otherwise you need to transform the column to extract the numeric value from the unit
- All the data in a column using the same unit, unless otherwise specified (e.g., you do not want data in meters, centimeters, feet or inches in the same column), this way your data is consistent, otherwise you need to identify the scales/units used and transform the data to use a common unit
- Little to no row duplication, this way you know that your data was collected without creating duplicate entries, which sometimes happen when databases are merged manually to create a data file, otherwise you may have to drop the duplicate rows or identify how many of the duplicates should be kept
Time series forecasting projects
History / Edit / PDF / EPUB / BIB / 4 min read (~783 words)What are the general steps of a time series forecasting project?
Using a tool such as pandas-profiling, the dataset provided by the client is profiled and a variety of summary statistics produced, such as the min, mean, median, max, quartiles, number of samples, number of zeros, missing values, etc. are computed for numerical values. Other types of data also have their own set of properties computed.
These summary statistics allow you to quickly have a glance at the data. You will want to look for missing values to assess whether there's a problem with the provided data. Sometimes missing data can imply that you should use the prior value that was set. Sometimes it means that the data isn't available, which can be an issue and may require you to do some form of data imputation down the road.
Common things to look for in time series data are gaps in data (periods where no data has been recorded), the trend/seasonality/residual decomposition per time series, the autocorrelation and partial autocorrelation plots, distribution of values grouped by a certain period (by month, by week, by day, by day of the week, by hour), line/scatter plots of values grouped by the same periods.
Data is rarely clean and ready to be consumed. This means many things: removing invalid values, converting invalid values or values out of range into a valid range, splitting cells that have multiple values in them into separate cells (e.g., "10 cm" split into "10" and "cm").
A variety of transformations can be applied to the cleaned data, ranging from data imputation (setting values where values are missing using available data), applying a function on the data, such as power, log or square root transform, differencing (computing the difference with the prior value), going from time zoned date time to timestamps, etc.
Common feature generation transformations are applied, such as computing lagged values on variables, moving averages/median, exponential moving averages, extracting the latest min/max, counting the number of peaks encountered so far, etc. Feature generation is where you create additional information for your model to consume with the hope that it will provide it some signal it can make use of.
Before attempting to find a good model for the problem at hand you want to start with simple/naive models. The time series naive model simply predicts the future by using the latest value as its prediction.
With a baseline established, you can now run a variety of experiments, which generally means trying different models on the same dataset while evaluating them the same way (same training/validation splits). In time series, we do cross-validation by creating a train/validation split where the validation split (i.e., the samples in the validation set) occurs temporally after the training split. The cross-validation split represents different points in time at which the models are trained and evaluated for their performance.
After you've completed a few experiments you'll have a variety of results to analyze. You will want to look at your primary performance metric, which is generally defined as an error metric you are trying to minimize. Examples of error metrics are MAE, MSE, RMSE, MAPE, SMAPE, WAPE, MASE. Performance is evaluated on your validation data (out-of-sample) and lets you have an idea of how the model will perform on data it hasn't seen during training, which closely replicates the situation you will encounter in production.
With many models and their respective primary metric computed, you can pick the one which has produced the lowest error on many cross-validation train/test splits.
Once the model has been selected, it is packaged to be deployed. This generally implies something as simple as pickling the model object and loading it in the remote environment so it can be used to do predictions.
There are two modes of forecasting:
- Offline: Data used for forecasting is collected during a period of time and then a scheduled task uses this newly available data to create new forecasts. This is generally used for systems with large amounts of data where the forecasts are not needed in real-time, such as forecasting tomorrow's stock price, the minimum and maximum temperature, the volume of stocks that will be sold during the week, etc.
- Online: Data used for forecasting is given to the model and predictions are expected to be returned within a short time frame, on the order of less than a second to a minute.
Raw data is transformed and feature engineered, then given to the model to use to forecast.
Data used to do time series forecasting
History / Edit / PDF / EPUB / BIB / 2 min read (~362 words)What data do I need to do time series forecasting?
There are three values that you must know for each data point of your time series:
- its entity, which represents a unique value identifying the time series (e.g., a product SKU). Without this information, it is not possible to construct a sequence of points since there's no logical grouping between the points.
- its timestamp, which represents the moment in time the data point was recorded. Without this information, it is not possible to construct a sequence of points since there's no sequential ordering between the points.
- its target, which represents the measurement of the data point itself that we want to predict. Without this information, we have effectively nothing to base ourselves on.
Such information would look as follow when organized in a table:
Entity | Timestamp | Target |
---|---|---|
A | 1 | 5 |
A | 2 | 6 |
A | 3 | 7 |
B | 1 | 13 |
B | 2 | 27 |
B | 3 | 55 |
Additionally, you may also have recorded additional values at the same time, which can be a useful source of information when trying to predict a time series.
Entity | Timestamp | Target | Value 1 |
---|---|---|---|
A | 1 | 5 | 3 |
A | 2 | 6 | 2 |
A | 3 | 7 | 1 |
B | 1 | 13 | 47 |
B | 2 | 27 | 33 |
B | 3 | 55 | 5 |
Let see what happened if we removed each of these columns to illustrate their necessity.
Timestamp | Target |
---|---|
1 | 5 |
2 | 6 |
3 | 7 |
1 | 13 |
2 | 27 |
3 | 55 |
Removing the entity effectively leaves us with two values for the same timestamp. If the data was in this format and we were told that each time the timestamp goes below its previous value a new entity was defined, we would be able to reconstruct the initial table with its entity column.
Entity | Target |
---|---|
A | 5 |
A | 6 |
A | 7 |
B | 13 |
B | 27 |
B | 55 |
Removing the timestamp gives us the values the entity may take, but we don't know when. Again, if we're told that the rows have been kept in some order, we could reconstruct the timestamp column.
Entity | Timestamp |
---|---|
A | 1 |
A | 2 |
A | 3 |
B | 1 |
B | 2 |
B | 3 |
Removing the target column makes this problem impossible to solve. We're left with only the entities that were measured and the time of measurement, but no measurement, which makes the two other values useless.